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Abstract 

Did celebrity last longer in 1929, 1992 or 2009? We 
investigate the phenomenon of fame by mining a col- 
lection of news articles that spans the twentieth cen- 
tury, and also perform a side study on a collection 
of blog posts from the last 10 years. By analyzing 
mentions of personal names, we measure each per- 
son's time in the spotlight, using two simple met- 
rics that evaluate, roughly, the duration of a sin- 
gle news story about a person, and the overall du- 
ration of public interest in a person. We watched 
the distribution evolve from 1895 to 2010, expect- 
ing to find significantly shortening fame durations, 
per the much popularly bemoaned shortening of soci- 
ety's attention spans and quickening of media's news 
cycles. Instead, we conclusively demonstrate that, 
through many decades of rapid technological and so- 
cietal change, through the appearance of Twitter, 
communication satellites, and the Internet, fame du- 
rations did not decrease, neither for the typical case 
nor for the extremely famous, with the last statis- 
tically significant fame duration decreases coming in 
the early 20th century, perhaps from the spread of 
telegraphy and telephony. Furthermore, while me- 
dian fame durations stayed persistently constant, for 
the most famous of the famous, as measured by either 
volume or duration of media attention, fame dura- 
tions have actually trended gently upward since the 
1940s, with statistically significant increases on 40- 
year timescales. Similar studies have been done with 
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much shorter timescales specifically in the context of 
information spreading on Twitter and similar social 
networking sites. To the best of our knowledge, this is 
the first massive scale study of this nature that spans 
over a century of archived data, thereby allowing us 
to track changes across decades. 

Keywords: culturomics, media, attention modeling, 
social media, time series, historical trends, fame du- 
ration, news archives 



1 Introduction 

Beginning in the 19th century, long-distance commu- 
nication transitioned from foot to telegraph on land, 
and from sail to steam to cable by sea. Each new 
form of technology began with a limited number of 
dedicated routes, then expanded to reach a large frac- 
tion of the accessible audience, eventually resulting in 
near-complete deployment of digital electronic com- 
munication. Each transition represented an opportu- 
nity for news to travel faster, break more uniformly, 
and reach a broad audience closer to its time of in- 
ception. 

Even today, the increasing speed of the news cy- 
cle is a common theme in discussions of the soci- 
etal implications of technology. Stories break faster, 
are covered in less detail, and news sources quickly 
move on to other topics. Online and cable outlets ag- 
gressively search for novelty in order to keep eyeballs 
glued to screens. Popular non-fiction dedicates signif- 
icant coverage to this trend, which by 2007 prompted 
The Onioro, a satirical website, to offer the follow- 
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ing commentary on cable news provider CNN'^I of- 
ferings pQ: "CNN is widely credited with initiating 
the acceleration of the modern news cycle with the 
fall 2006 debut of its spin-off channel CNN:24, which 
provides a breaking news story, an update on that 
story, and a news recap all within 24 seconds." 

With this speed-up of the news cycle comes an as- 
sociated concern that, whether or not causality is at 
play, attention spans are shorter, and consumers are 
only able to focus for progressively briefer periods on 
any one news subject. Stories that might previously 
have occupied several days of popular attention might 
emerge, run their course, and vanish in a single day. 
This popular theory is consistent with a suggestion 
by Herbert Simon [10| that as the world grows rich in 
information, the attention necessary to process that 
information becomes a scarce and valuable resource. 

The speed of the news cycle is a difficult concept 
to pin down. We focus our study on the most com- 
mon object of news: the individual. An individual's 
fame on a particular day might be thought of as the 
probability with which a reader picking up a news 
article at random would see their name. From this 
idea we develop two notions of the duration of the in- 
terval when an individual is in the news. The first is 
based on fall-off from a peak, and intends to capture 
the spike around a concrete, narrowly-defined news 
story. The second looks for period of sustained pub- 
lic interest in an individual, from the time the public 
first notices that person's existence until the public 
loses interest and the name stops appearing in the 
news. We study the interaction of these two notions 
of "duration of fame" with the radical shifts in the 
news cycle we outline above. For this purpose, we 
employ Google's public news archive corpus, which 
contains over sixty million pages covering 250 years, 
and we perform what we believe to be the first study 
of the dynamics of fame over such a time period. 

Data within the archive is heterogeneous in nature, 
ranging from directly captured digital content to op- 
tical character recognition employed against micro- 
film representations of old newspapers. The crawl is 
not complete, and we do not have full information 
about which items are missing. Rather than attempt 
topic detection and tracking in this error-prone envi- 
ronment, we instead directly employ a recognizer for 
person names to all content within the corpus; this 
approach is more robust, and more aligned with our 
goal of studying fame of individuals. 

Based on these different notions of periods of refer- 
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ence to a particular person, we develop at each point 
in time a distribution over the duration of fame of 
different individuals. 

Our expectation upon undertaking this study was 
that in early periods, improvements to communica- 
tion would cause the distribution of duration of cov- 
erage of a particular person to shrink over time. We 
hypothesized that, through the 20th century, the con- 
tinued deployment of technology, and the changes to 
modern journalism resulting from competition to of- 
fer more news faster, would result in a continuous 
shrinking of fame durations, over the course of the 
century into the present day. 
Summary of findings. 

We did indeed observe fame durations shortening 
somewhat in the early 20th century, in line with our 
hypothesis regarding accelerating communications. 
However, from 1940 to 2010, we see quite a differ- 
ent picture. Over the course of 70 years, through a 
world war, a global depression, a two order of magni- 
tude growth in (available) media volume, and a tech- 
nological curve moving from party-line telephones to 
satellites and Twitter, both of our fame duration met- 
rics showed that neither the typical person in 
the news, i.e. the median fame duration, nor 
the most famous, i.e. high-volume or long- 
duration outliers, experienced any statistically 
significant decrease in fame durations. 

As a matter of fact, the bulk of the distribu- 
tion, as characterized by median fame dura- 
tions, stayed constant throughout the entire 
century-long span of the news study and was 
also the same through the decade of Blogger 
posts on which we ran the same experiments. 
As another heuristic characterization of the bulk of 
the distribution, both news and Blogger data pro- 
duced roughly comparable parameters when fitted to 
a power law: an exponent of around -2.5, although 
with substantial error bars, suggesting that the fits 
were mediocre. 

Furthermore, when we focused our attention on the 
very famous, by various definitions, all signs pointed 
to a slow but observable growth in fame durations. 
From 1940 onward, on the scale of 40-year in- 
tervals, we found statistically significant fame 
duration growth for the "very famous" , defined 
as either: 

• people whose fame lasts exceptionally long: 90th 
and 99th percentiles of fame duration distribu- 
tions; or 

• exceptionally highly-discussed people: using dis- 
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Figure 1: The volume of news articles by date. 



tributions among just the top 1000 people or the 
top 0.1% of people by number of mentions within 
each year. 

In the case of taking the 1000 most-oftcn- 
mentioned names in each year, the increasing could 
be explained as follows: as the corpus increases in 
volume toward later years, a larger number of names 
appear, representing more draws from the same un- 
derlying distribution of fame durations. The quan- 
tiles of the distribution of duration for the top 1000 
elements will therefore grow over time as the corpus 
volume increases. On the other hand, our experi- 
ments that took the top 0.1% most-often-mentioned 
names, or the top quantiles of duration, still showed 
an increasing trend. We therefore conclude that the 
increasing trend is not completely caused by an in- 
crease in corpus volume. 

To summarize, we find that the most famous figures 
in today's news stay in the limelight for longer than 
their counterparts did in the past. At the same time, 
however, the average newsworthy person remains in 
the limelight for essentially the same amount of time 
today as in the past. 

2 Working with the news cor- 
pus 

We perform our main study on a collection of the 
more than 60 million news articles in the Google 
archive that are both (1) in English, and (2) search- 
able and readable by Google News users at no cost. In 
Section [SJ we cross- validate our observations against 
the corpus of public blog posts on Bloggcr, which is 
described there. 

The articles of the news corpus span a wide range 
of time, with the relative daily volume of articles over 
the range of the corpus shown in Figure [TJ There are a 
handful of articles from the late 18th century onward, 
and the article coverage grows rapidly over the course 
of the 19th century. From the last decade of the 19th 
century through the end of the corpus (March 2011), 



there is consistently a very substantial volume of ar- 
ticles per day, as well as a wide diversity of publi- 
cations. For the sake of statistical significance, our 
study focuses on the years 1895-2011. 

The news corpus contains a mix of modern arti- 
cles obtained from the publisher in the original dig- 
ital form, as well as historical articles scanned from 
archival microform and OCRed, both by Google and 
by third parties. For scanned articles, per-article 
metadata such as titles, issue dates, and boundaries 
between articles are also derived algorithmically from 
the OCRed data, rather than manually curated. 

Our study design was driven by several features 
that we discovered in this massive corpus. We list 
them here to explain our study design. Also, data 
mining for high-level behavioral patterns in a di- 
achronous, heterogeneous, partially-OCRed corpus of 
this scale is quite new, precedented on this scale per- 
haps only by [S] (which brands this new area as "cul- 
turomics"). But, with the rapid digitization of his- 
torical data, we expect such work to boom in the near 
future. We thus hope that the lessons we have learned 
about this corpus will also be of independent inter- 
est to others examining this corpus and other similar 
archive corpuses. 

2.1 Corpus features, misfeatures, and 
missteps 

2.1.1 News mentions as a unit of attention 

Our 116-year study of the news corpus aims to ex- 
tend the rich literature studying topic attention in 
online social media like Twitter, typically over the 
span of the last 3-5 years. Needless to say, 100-year- 
old printed newspapers are an imperfect proxy for 
the attention of individuals, which has only recently 
become directly observable via online behavior. Im- 
plicit in the heart of our study is the assumption that 
news articles are published to serve an audience, and 
the media makes an effort, even if imperfect, to cater 
to the audience's information appetites. We coarsely 
approximate a unit of attention as one occurrence 
in a Google News archive article, and we leave open 
a number of natural extensions to this work, such as 
weighting articles by historical publication subscriber 
counts, or by size and position on the printed page. 

Due to the automated OCR process, not every 
"item" in the corpus can be reasonably declared a 
news article. For example, a single photo caption 
might be extracted as an independent article, or a 
sequence of articles on the same page might be mis- 
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interpreted as a single article. Rather than weighting 
each of these corpus items equally when measuring 
the attention paid to a name, we elected to count 
multiple mentions of a name within an item sepa- 
rately, so that articles will tend to count more than 
captions, and there is no harm in mistakenly grouping 
multiple articles as one. 

We manually examined (A) a uniform sample of 50 
articles from the whole corpus (which, per Fig.[TJ con- 
tains overwhelmingly articles from the last decade), 
and (B) a uniform sample of 50 articles from 1900- 
1925. We classified each sample into: 

• News articles: timely content, formatted as a 
stand-alone "item", published without external 
sponsorship, for the benefit of part of the publi- 
cation's audience, 

• News-like items: non-article text chunks where a 
name mention can qualify as that person being 
"in the news" — e.g. photo captions or inset 
quotes, 

• Non-news: ads and paid content, sports scores, 
recipes, news website comments miscategorized 
as news, etc. 

The number of items of each type in the two samples 
are given in the following table. 





full corpus sample 


1900-1925 sample 


news articles 


31 


28 


news-like items 


3 


2 


non-news items 


16 


20 



We expect that the similarity in these distributions 
should result in minimal noise in the cross-temporal 
comparisons, and leave to future work the task of 
automatically distinguishing real news stories from 
non-news. 



2.1.2 Compensating for coverage 

Even once we discard the more sparsely covered 18th 
and 19th centuries, there is still more than an order of 
magnitude difference between article volume in 1895 
and 2011. We address these coverage differences by 
downsampling the data down to the same number of 
articles for each month in this range. We address the 
nuanced effects of this downsampling on our method- 
ology in Section [331 

2.1.3 Evolution of discourse and media — 
why names? 

We set out originally to understand changes in the 
public's attention as measured by news story topics. 



1900 1920 1940 1900 1900 2000 

Figure 2: Articles with recognized personal names 
per decade 

There are a myriad heuristics to define a computa- 
tionally feasible model of a "single topic" that can 
be thought to receive and lose the public's attention. 
But over the course of a century, the changes in so- 
ciety, media formatting, subjects of public discourse, 
writing styles, and even language itself are substan- 
tial enough that neither sophisticated statistical mod- 
els trained on plentiful, well-curated training data 
from modern media nor simple generic approaches 
like word co-occurrence in titles are guaranteed to 
work well. Very few patterns connect articles from 
1910 newspapers' "social" sections (now all but for- 
gotten) about tea at Mrs. Smith's, to 1930 articles 
about the arrival of a trans-oceanic liner, to 2009 ar- 
ticles about a viral Youtube video. 

After trying out general proper noun phrases pro- 
duced inconclusively noisy results, we decided to fo- 
cus on occurrences of personal names, detected in 
the text by a proprietary state-of-the-art statistical 
recognizer. Personal names have a relatively stable 
presence in the media: even with high OCR error 
rates in old microform, over l/7th of the articles even 
in the earliest decades since 1900 contain recognized 
personal names (sec Figure [2|). 

But personal names are not without historical 
caveats, either. A woman appearing in 2005 stories 
as "Jane Smith" would be much more likely to be ex- 
clusively referenced as "Mrs. Smith", or even "Mrs. 
John Smith", in 1915. Also, the English-speaking 
world was much more Anglo-centric in 1900 than now, 
with much less diversity of names. An informal sam- 
ple suggests that most names with non-trivial news 
presence 100 years ago referred overwhelmingly to a 
single bearer of that name for the duration of a par- 
ticular news topic, but many names are not unique 
when taken across the duration of the whole corpus 
— for instance, "John Jacob Astor" , appearing in the 
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news heavily over several decades (Fig. [3]) , in refer- 
ence to a number of distinct relatives. On account 
of both of these phenomena, among others, we aim 
to focus on name appearance patterns that are most 
likely to represent a single news story or contiguous 
span of public attention involving that person, rather 
than trying to model the full media "lifetime" of in- 
dividuals, as we had considered doing at the start of 
this project. 

2.1.4 OCR errors in data and metadata 

We empirically discovered another downfall of study- 
ing long-term "media lifetimes" of individuals. In 
an early experiment, we measured, for each personal 
name, the 10th and 90th percentiles of the dates 
of that name's occurrence in the news. We then 
looked at the time interval between 10th and 90th 
percentiles, postulating that a large enough fraction 
of names are unique among newsworthy individu- 
als that the distribution of these inter- quantile gaps 
could be a robust measure of media lifetime. After 
noticing a solid fraction of the dataset showing inter- 
quantilc gaps on the scale of 10-30 years, we examined 
a heat map of gap durations, and discovered a regular 
pattern of gap durations at exact-integer year offsets, 
which, other than for Santa Claus, Guy Fawkes, and 
a few other clear exceptions, seemed an improbable 
phenomenon. 

This turned out to be an artifact of OCRcd meta- 
data. In particular, the culprit was single-digit OCR 
errors in the scanned article year. While year errors 
are relatively rare, every long-tail name that occurred 
in fewer than 10 articles (often within a day or two 
of each other), and had a mis-OCRed error for one 
of those occurrences contributed probability mass to 
integral-number-of-years media lifetimes. As extra 
evidence, the heat map had a distinct outlier seg- 
ment of high probability mass for inter-quantile range 
of exactly 20 years, starting in the 1960s and ending 
in the 1980s — the digits 6 and 8 being particularly 
easy to mistake on blurry microfilm. Note that short- 
term phenomena are relatively safe from OCR date 
errors, thanks to the common English convention of 
written-out month names, and to the low impact of 
OCR errors in the day number. 

OCR errors in the article text itself are ubiquitous. 
Conveniently, the edit distance between two recogniz- 
able personal names is rarely very short, so by agree- 
ing to discard any name that occurs only once in the 
corpus, we are likely to discard virtually all OCR er- 
rors as well, with no impact on data on substantially 



newsworthy people. We should note that OCR er- 
rors are noticeably more frequent on older microfilm, 
but the reasonable availability of recognizable per- 
sonal names even in 100-year-old articles, per Fig. [5J 
suggests that this problem is not dire. A manually- 
coded sample of 50 articles with recognized names 
from the first decade of the 1900s showed only 8 out 
of 50 articles having incorrectly recognized names (in- 
cluding both OCR errors and non-names mis-tagged 
as names). 

2.1.5 Simultaneity and publishing cycles 

There are also pitfalls with examining short time- 
lines. In the earliest decades we examine, telegraph 
was widely available to news publishers, but not fully 
ubiquitous, with rural papers often reporting news 
"from the wire" several days after the event. An in- 
formal sample seems to suggest that most news by 
1900 propagated across the world on the scale of a 
few days. Also, many publications in the corpus un- 
til the last 20 years or so were either published ex- 
clusively weekly or, in the case of Sunday newspaper 
issues, had substantially higher volume once a week, 
resulting in many otherwise obscure names having 
multiple news mentions separated by one week — a 
rather different phenomenon than a person remain- 
ing in the daily news for a full week. On account of 
both of these, we generally disregard news patterns 
that are shorter than a few days in our study design. 

3 Measuring Fame 

We begin by producing a list of names for each arti- 
cle. To do this, we extract short capitalized phrases 
from the body text of each article, and keep phrases 
recognized by an algorithm to be personal names. 

For every name that appears in the input, we con- 
sider that name's timeline, which is the multiset of 
dates at which that name appears, including mul- 
tiple occurrences within an article. We intend the 
timeline to approximate the frequency with which a 
person browsing the news at random on a given day 
would encounter that name. The accuracy of this 
approximation will depend on the volume of news ar- 
ticles available. In order to avoid the possibility that 
any trends we detect are caused by variations in this 
accuracy caused by variations in the volume of the 
corpus, we randomly choose an approximately equal 
number of articles to work with from each month. We 
describe and analyze this process in Section [3.31 
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In general, our method can be applied to any collec- 
tion of timelines. In Section [SJ we apply it to names 
extracted from blog posts. 

3.1 Finding Periods of Fame 

Once we have computed a timeline for each name that 
appears in the corpus, we select a time during which 
we consider that name to have had its period of fame, 
using one of the two methods described below. In 
order to compare the phenomenon of fame at different 
points in time, we consider the joint distribution of 
two variables over the set of names: the peak date and 
the duration of the name's period of fame. We try the 
following two methods to compute a peak date and 
duration for each timeline. 

• Spike method. This method intends to capture 
the spike in public attention surrounding a par- 
ticular news story. We divide time into one-week 
intervals and consider the name's rate of occur- 
rence in each interval. The week with the highest 
rate is considered to be the peak date, and the 
period extends backward and forward in time as 
long as the rate does not drop below one tenth 
its maximum rate. Yang and Leskovec JT3] used 
a similar method in their study of digital media, 
using a time scale of hours where we use weeks. 

• Continuity method. This method intends to 
measure the duration of public interest in a per- 
son. We define a name's period of popularity to 
be the longest span of time within which there is 
no seven-day period during which it is not men- 
tioned. The peak date falls halfway between the 
beginning and the end of the period. We find, in 
Section |4l that durations are short compared to 
the time span of the study, so using any choice 
of peak date between the beginning and end will 
produce similar distributions. 

To demonstrate the distinction between these two 
methods, Figure [3] shows the occurrence timeline for 
Marilyn Monroe. The "continuity method" picks out 
the bulk of her fame — 1952-02-13 ("A") through 
1961-11-15 ("D"), by which point her appearance in 
the news was reduced to a fairly low background level. 
The "spike method" picks out the intense spike in in- 
terest surrounding her death, yielding the range 1962- 
7-18 ("E") - 1962-8-29 ("H"). 

Very often these two methods identify short mo- 
ments of fame within a much longer context. For ex- 
ample, in Figure [3l we see the timeline for the name 
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Figure 3: Timelines for "Marilyn Monroe" (top) and 
"John Jacob Astor" (bot). 



"John Jacob Astor" , normalized by article counts. 
The spike method identifies as the peak the death 
of John Jacob Astor III of the wealthy Astor family, 
with a duration of 38 days (March 8 to February 15, 
1890). The continuity method identifies instead the 
death of his nephew John Jacob Astor IV, who died 
on the Titanic, with a period of five months [12]. The 
period begins on March 23, 1912, three weeks before 
the Titanic sank, and ends August 31. Many of the 
later occurrences of the name are historical mentions 
of the sinking of the Titanic. 

3.2 Choosing the Set of Names 

Basic filtering In all our experiments, to reduce 
noise, we discard the names which occurred less than 
ten times, or whose fame durations are less than two 
days. (In both methods, a name whose fame begins 
Monday and ends Wednesday is considered to have 
a duration of two days.) We also remove peaks that 
end in 2011 or later, since these peaks might extend 
further if our news corpus extended further in the 
future. 

Top 1000 by year For each peak type, we repeat 
our experiment with the set of names restricted in the 
following way. We counted the total number of times 
each name appeared in each year (counting repeats 
within an article). For each year, we produced the 
set of the 1000 most frequently mentioned names in 
that year. We took the union of these sets over all 
years, and ran our experiments using only the names 
in this set. Note that a name's peak of popularity 
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need not be the same year in which that name was in 
the top 1000: so if a name is included in the top-1000 
set because it was popular in a certain year, we may 
yet consider that name's peak date to be a different 
year. 

Top 0.1% by year We consider that filtering to 
the top 1000 names in each year might introduce the 
following undesirable bias. Suppose names are as- 
signed peak durations according to some universal 
distribution, and later years have more names, per- 
haps because of the increasing volume of news. If a 
name's frequency of occurrence is proportional to its 
duration, then selecting the top 1000 names in each 
year will tend to produce names with longer dura- 
tions of fame in years with a greater number of names. 
With this in mind, we considered one more restriction 
on the set of names. In each year y, we considered 
the total number of distinct names n y mentioned in 
that year. We then collected the top n y /1000 names 
in each year y. We ran our experiments using only 
the names in the union of those sets. As with the 
top-1000 filtering, a name's peak date will not nec- 
essarily be the same year for which it was in the top 
0.1% of names. 

3.3 Sampling for Uniform Coverage 

The spike and continuity methods for identifying pe- 
riods of fame may be affected by the volume of arti- 
cles available in our corpus. For example, suppose a 
name's timeline is generated stochastically, with ev- 
ery article between February 1 and March 31 contain- 
ing the name with a 1% probability. If the corpus 
contains 10000 articles in every week, then both the 
spike and continuity methods will probably decide 
that the article's duration is two months. However, 
if the corpus contains less than 100 articles in each 
week, then the durations will tend to be short, since 
there will be many weeks during which the name is 
not mentioned. 

We propose a model for this effect. Each name 
v has a "true" timeline which assigns to each day 
t a probability f v (t)€ [0,1] that an article on that 
day will mention z^o For each day, there is a total 
number of articles n t ; we have no knowledge of the 
relation between n t and v, except that there is some 
lower bound n t > « m in for all t within some reason- 
able range of time. Then we suppose the timeline for 

3 In fact, articles could mention the name multiple times, 
but in the limit of a large number of articles, this will not affect 
our analysis. 



name v is a sequence of independent random vari- 
ables X v<t ~ Binom(/ !/ (t), n t ). Our goal is to ensure 
that any measurements we take are independent of 
the values n t . 

To accomplish this independence of news volume, 
we randomly sampled news articles so that the ex- 
pected number in each month was n m j n . Let X' v t 
be the number of sampled articles containing name 
v. If we were to randomly sample n m j n articles 
without replacement, then we would have X' v t ~ 
Binom(/ 1/ (t), n m j n ). Notice that the joint distribu- 
tion of the random variables X' v t is unaffected by the 
article volumes n t . Any further measurement based 
on the variables X' v t will therefore also be unrelated 
to the sequence n t . In practice, instead of sampling 
exactly n min articles without replacement, we flipped 
a biased coin for each of the n t articles at time t, 
including each article with probability n m i n /n t . For 
a large enough volume of articles, the resulting mea- 
surements will be the same. 

We removed all articles published before 1895, 
since the months before 1895 had less than our tar- 
get number n min of articles. We also removed articles 
published after the end of the year 2010, to avoid hav- 
ing a month with news articles at the beginning but 
not the end of the month, but with the same number 
of sampled articles. 

As an example of the effect of downsampling, the 
blue dotted lines in Figure [9] show the 50th, 90th 
and 99th percentiles of the distribution of fame du- 
rations using the continuity method. We see that 
they increase suddenly in the last ten years, when our 
coverage of articles surges with the digital age. The 
red lines show the same measurement after downsam- 
pling: the surge no longer appears. 

3.4 Graphing the Distributions 

We graph the joint distribution of peak dates and 
durations in two different ways. We consider the set 
of names which peak in successive five-year periods. 
Among each set of names, we graph the 50th, 90th 
and 99th percentile durations of fame. These appear 
as darker lines in the graphs; for example, the top of 
Fig. [6] shows the distribution for the spike method. 
The lighter solid red lines show the same three quan- 
tiles for shorter three-month periods. For compari- 
son, the dashed light blue lines show the same results 
if the article sampling described in Sec. I3.3l is not per- 
formed (and articles before 1895 and after 2010 are 
not removed). Fig. [5] shows the same set of lines us- 
ing the continuity method. All the later figures are 
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produced in the same way, except they do not include 
the non-sampled full distributions. 

The second type of graph focuses on one five-year 
period at a time. The bottom of Fig. [5] shows a cu- 
mulative plot showing the number of names with du- 
ration greater than that shown on the x-axis. This 
is plotted for many five-year periods. The graphs of 
measurements using the spike method look more like 
step functions because that method measures dura- 
tions in seven-day increments, whereas the longest- 
stretch method can yield any number of days. (Recall 
that peaks that last less than two days are removed.) 

3.5 Estimating Power Law Exponents 

We test the hypothesis that the tail of the distribution 
of fame durations follows a power law. For a given 
five-year period, we collect all names which peak in 
that period, and consider 20% of the names with the 
longest fame durations - that is, we set d m i n to be 
the 80th percentile of durations, and consider dura- 
tions d > d m i n . Among those 20%, we compute a 
maximum likelihood estimate of the power law ex- 
ponent a, predicting that the probability of a dura- 
tion d > d m ; n is p(d) oc d a . Clauset ct al [3] show 
that the maximum likelihood estimate a is given by 
a = l + (X^"=i ^ n (di / d m i n )) . We include a line on each 
plot of cumulative distributions of fame durations, of 
slope a + 1 on the log-log graph because we plot cu- 
mulative distributions rather than density functions. 
The & values we measure are discussed in the follow- 
ing sections, and summarized in Figure|4]for the news 
corpus and Figure [5] for the blog corpus. 

3.6 Statistical Measurements 

We used bootstrapping to estimate the uncertainty in 
the four statistics we measured: the 50th, 90th and 
99th percentile durations and of the best-fit power 
law exponents. For selected five-year periods, we 
sampled 15*1 names with replacement from the set S 
of names that peaked in that period of time. For each 
statistic, we repeated this process 25000 times, and 
reported the range of numbers within which 99% of 
our samples fell. The results are presented in Figures 
2] (for the news corpus) and [5] (for the blog corpus). 



4 Results: News Corpus 

We measure periods of popularity using the spike 
and continuity methods described in Section [3l and 
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Figure 6: Fame durations measured using the spike 
method, plotted as the 50th, 90th and 99th per- 
centiles over time (top) and for specific five-year pe- 
riods (bottom). The bottom graph also includes a 
line showing the max-likelihood power law exponent 
for the years 2005-9. (The slope on the graph is one 
plus the exponent from Fig. [4j since we graph the 
cumulative distribution function.) To illustrate the 
effect of sampling for uniform article volume, the first 
graph includes measurements taken before sampling; 
see Sec. I3.3I Section I3.4I describes the format of the 
graphs in detail. 



in each case plot the distribution of duration as it 
changes over time. 

Figures [6] and [9] show the evolution of the distri- 
bution of fame durations for the full set of names 
in the corpus (after the basic filtering described in 
Section l3~2j) using the spike and continuity methods, 
respectively. (Section I3.4I describes the format of the 
graphs in detail.) 

Median durations For the entire period we stud- 
ied, the median fame duration did not decrease, as we 
had expected, but rather remained completely con- 
stant at exactly 7 days, for both the spike and the 
continuity peak measurement methods. For the spike 
method alone, this would not have been surprising. 
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Figure 7: Fame durations, restricting to the union of 
the 1000 most-mentioned names in every year, using 
the spike method to identify periods of fame. 



Figure 8: Fame durations, restricting to the union of 
the 0.1% most-mentioned names in every year, mea- 
sured using the spike method. 



Peaks measured by the spike method are discretized 
to multiples of weeks, so a perennial median of 7 
days just shows that multi-week durations have never 
been common. On the other hand, the continuity 
method freely admits fame durations in increments of 
1 day, with only 1-day- long peaks filtered out. Yet, 
the median has remained at exactly 7 days for all 
the years studied, and, per the full-corpus "50th per- 
centile" measurements, shown in blue in FigureS] for 
all decades where we've tried bootstrapping, 99% of 
bootstrapped samples also matched the 7-day mea- 
surement exactly (for the continuity method and, less 
surprisingly, for the spike method) . This gives strong 
statistical significance to the claim that 7 days is in- 
deed a very robust measurement of typical fame du- 
ration, which has not varied in a century. 

The most famous We next consider specially the 
fame durations of the most famous names, in two 
correlated, but distinct senses of "most famous" : 

• "Duration outliers" — people whose fame lasts 
much longer than typical, as measured by 
the 90th and 99th percentiles of fame durations 
within each year. These correspond to the top 



two lines in the timelines of Figures [6] and El and 
the columns "90 %ile" and "99 %ile" of the first 
and fourth blocks of Figure 

• "Volume outliers" - the names which appear 
the most frequently in the news, by being ei- 
ther in the top 1000 most frequent names in some 
year, or, separately, names in the top 0.1%, as 
per Section |3"?2"1 The graphs for these subsets of 
names are shown in Figures [7] and [8] for the spike 
method, and Figures [10] and [TT] for the conti- 
nuity method, and the statistical measurements 
appear in blocks 2, 3, 5 and 6 of Figured] 

From the 1900's to the 1940's, the fame durations 
in both categories of outliers do tend to decrease, with 
the decreases across that time interval statistically 
signicantly lower-bounded by 1-2 weeks via 99% boot- 
strapping intervals. Hcuristically, this seems consis- 
tent with our original hypothesis that accelerating 
communications shorten fame durations: 1-2 weeks 
is a reasonable delay to be incurred by sheer commu- 
nications delay before the omnipresence of telegraphy 
and telephony. We note with curiosity that this ef- 
fect applies only to the highly-famous outliers rather 
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Figure 9: Fame durations measured using the con- 
tinuity method, plotted as the 50th, 90th and 99th 
percentiles over time (top), and for specific five-year 
periods (bottom). To illustrate the effect of sampling, 
the first graph includes measurements taken before 
sampling; see Section I3.3I Section I3.4I describes the 
format of the graphs in detail. 
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Figure 10: Fame durations, restricting to the union 
of the 1000 most-mentioned names in every year, 
measured using the continuity method. 



than the typical fame durations. We posit that this is 
perhaps due to median fame durations being typically 
attributable to people with only geographically local- 
ized fame, which does not get affected by long com- 
munication delays. We leave to further work a more 
nuanced study to test these hypotheses around local- 
ity and communication delays affecting news spread 
in the early 20th century. 

After the 1940's, on the other hand, we see no such 
decrease. On the contrary, the durations of fame for 
both the duration outliers and the volume outliers re- 
verse the trend, and actually begin to slowly increase. 
Using the bootstrapping method, per Section [3~6l we 
get the results marked in red in Figure |4] in almost 
all of the outlier studied, we see that the increase in 
durations is statistically significant over 40-year gaps 
for both categories of fame outliers. For example, the 
median fame duration according to continuity peaks 



4 7 out of the 8 outlier studies show statistically significant 
increases between the 1940's and the 1980's, and between the 
1960's and the 2000's. The sole exception is the 90th percentile 
of the spike method. Given that the bootstrap values in that 
experiment, discretized to whole weeks, range between 3 and 4 
weeks, we don't consider it surprising that the increases there 
were not measured to be significant by 99% bootstrap intervals. 



10 



£ 100 



50 90 99 3-month groups 
5-year groups 

I I ill 




o 



0.1 



0.01 



1 9041 91 61 9281 9401 9521 9641 9761 9882000 
Peak of fame 



(5-year bucke"tej 1905-9 
v\1 925-9 
19-55 9 
*S65-9 
1985-9 
2005-i 

power law slope 20Q 



10 100 1000 

Duration of tame (days) 



Figure 11: Fame durations, restricting to the union 
of the 0.1% most-mentioned names in every year, 
measured using the continuity method. 



for the top 1000 names (50th percentile column of 
the fifth block) appears as "27 (25 .. 29)" in the pe- 
riod 1945-9 and "52 (49 .. 56)" for the period 1985-9: 
with 99% confidence, the median duration was less 
than 29 days in the former period, but greater than 
49 days in the latter. 

We also ran experiments for names that have out- 
lier durations within the subset of names with outlier 
volumes. The same general trends were seen there 
as with the above outlier studies, but, with a far 
shallower pool of data, the bootstrapping-based er- 
ror bars were generally large enough to not paint a 
convincing, statistically significant picture. 

Power law fits The column titled "power law ex- 
ponent" in Figure |4] shows the maximum likelihood 
estimates of the power law exponents for various five- 
year-long peak periods. We focus on the first and 
fourth blocks, which show the estimates for the full 
set of names for the spike method and the continuity 
method respectively. 

For both peak methods, the fitted power law expo- 
nents remain in fairly small ranges — between -2.77 
and -2.45 for continuity peaks, and between -2.63 and 



-2.32 for spike peaks. In Figures |H] and |5] we show 
the actual distributions, and, for reference, compar- 
isons with the power-law fit for the 2005-2009 data 
(a straight line on these log- log plots). 

Furthermore, the continuity peaks fits also sup- 
port the above observation of slowly-growing long-tail 
fame durations from 1940 onward. That is, power- 
law exponents from 1940 onward slowly move toward 
zero, with statistically significant changes when com- 
pared at 40-year intervals. The fluctuations and the 
error bars for both methods are rather noticeable, 
though, suggesting that power laws make for only a 
mediocre fit to this data. 



5 Results: Blog Posts 

We also ran our experiments on a second set of data 
consisting of public English-language blog posts from 
the Blogger service. We began by sampling so that 
the number of blog posts in each month in our data 
set was equal to the number of news articles we sam- 
pled in each month, as per Sec. 13.31 The cumula- 
tive graphs of fame duration from six experiments 
are shown in Fig. [12j We combine the two meth- 
ods for identifying periods of fame with three sets of 
names described in Section 13.21 The respective dis- 
tributions from the news corpus are superimposed for 
comparison. 

The graphs of fame duration measured using the 
continuity method are much smoother for the blog 
corpus than for the news corpus. This happens be- 
cause whereas we only know which day each news 
article was written, we know the time of day each 
blog entry was posted. 

The continuity-method graphs (bottom of Fig- 
ure ll2p had a distinctive rounded cap which surprised 
us at first. We believe it is caused by the following ef- 
fect. Peaks with only two mentions in them are fairly 
common, and have a simple distinctive distribution 
that is the difference between two sample dates con- 
ditioned on being less than a week apart. Since two 
dates that are longer than one week apart cannot 
constitute a longest-stretch peak, the portion of the 
graph with durations longer than one week does not 
include any names from this two-sample distribution, 
and so it looks different. Our estimates of power- law 
exponents only consider the longest 20% of durations, 
so they ignore this part of the graph. 

The estimates we computed for the power-law ex- 
ponents of the duration distributions for blog data 
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are shown in Figure [5l and can be compared to the 
figures for news articles in Figure [4] 

The medians for both blogs and news for both 
methods are remarkably the same, with no statisti- 
cally significant differences. The power law fits are 
also quite similar, although they show enough vari- 
ation to produce statistically significant differences. 
Qualitatively, we take these as evidence that the fame 
distributions in news and blogs are coarsely similar, 
and that it is not unreasonable to consider these re- 
sults as casting some light on more fundamental as- 
pects of human attention to and interest in celebri- 
ties, rather than just on the quirks of the news busi- 
ness. 

We do leave open the question of accounting for the 
occasionally significant distinctions between outlier 
results for blogs, as compared to news, especially for 
outlier-volume continuity peaks. 



6 Related Work 

Michel et al. [S] study a massive corpus of digitized 
content in an attempt to study cultural trends. The 
corpus they study is even larger than ours in terms 
of both volume and temporal extension. 

Leetaru [7] presents evidence that sentiment anal- 
ysis of news articles from the past decade could have 
been used to predict the revolutions in Tunisia, Egypt 
and Libya. 

Our spike method for identifying periods of fame 
is motivated in part by the work of Yang and Le- 
scovec |13j on identifying patterns of temporal varia- 
tion on the web. Szabo and Huberman also con- 
sider temporal patterns, in their case regarding con- 
sumption of particular content items. Kleinberg stud- 
ies other approaches to identification of bursts [6] . 

Numerous works have studied the propagation of 
topics through online media. Leskovcc et al. [8] de- 
velop techniques for tracking short "memcs" as they 
propagate through online media, as a means to under- 
standing the news cycle. Adar and Adamic [2], and 
Gruhl et al. [5] consider propagation of information 
across blogs. 

Finally, a range of tools and systems provide access 
to personalized news information; see Gabrilovich et 
al [4] and the references therein for pointers. 
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Figure 4: Percentiles and best-fit power-law exponents for five-year periods of the news corpus. Each 
entry shows the estimate based on the corpus, and the 99% boostrap interval in parentheses, as described 
in Section EUl Results discussed in section |U 
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28 (25 .. 35) 88 (74 .. 102) 213 (113 .. 1674) -3.29 (-5.40 .. -2.23) 


continuity 
continuity 


all 
all 


2000-4 
2005-9 


7 (7 .. 7) 22 (20 .. 23) 114 (95 .. 160) -2.38 (-2.49 .. -2.28) 
6 (6 .. 7) 18 (17 .. 19) 80 (66 .. 93) -2.62 (-2.72 .. -2.53) 


continuity 
continuity 


top 1000 
top 1000 


2000-4 
2005-9 


20 (18 .. 21) 71 (59 .. 83) 387 (237 .. 819) -2.32 (-2.54 .. -2.12) 

21 (20 .. 22) 59 (53 .. 73) 408 (211 .. 1057) -2.37 (-2.62 .. -2.18) 


continuity 
continuity 


top 0.1% 
top 0.1% 


2000-4 
2005-9 


102 (89 .. 123) 372 (236 .. 768) 2010 (768 .. 2238) -2.24 (-3.15 .. -1.86) 
83 (70 .. 93) 302 (193 .. 617) 2083 (954 .. 2991) -2.12 (-2.75 .. -1.79) 



Figure 5: Percentiles and best-fit power-law exponents for five-year periods of the blog corpus. Each entry 
shows the estimate based on the corpus, and the 99% boostrap interval in parentheses, as described in 
Section [3^61 Results discussed in Section [5] 
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Figure 12: Cumulative duration-of-fame graphs for the blog corpus. The graphs at the top show the spike 
method results (for all names, top 1000, and top 0.1%), and those at the bottom show the continuity method 
results. 
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